Gittins and Whittle's Index

Bandit Problem

Markov Bandit Process

MDP on a countable state space, where $ξ (t) \in {ξ_{1}, . . ., ξ_{K}}$ is the state of the bandit at the discrete decision time $t \in {0, 1, 2, . . .}$ .
Controls applied at decision time $t$ :
- $u (t) = 0$ freezes the process and gives no reward
- $u (t) = 1$ continues the process and gives instantaneous reward $a^{t} r (ξ (t))$ .
$r (\cdot) > 0$ is the bounded reward
$a \in (0, 1)$ is the discount factor

$n$ 个 bandit 每次仅“服务”一个 bandit，其余的状态不变
- $n$ Markov Bandit Processes with state space $\vec{E} = E_{1} \times E_{2} \times \dots \times E_{n}$
- Control $u (t) = 1$ is applied to a single bandit it at each decision time $t$ .
- All other bandits remain in the same state.
Other Notations
- ${i_{1}, i_{2}, . . .}$ : Sequence of selected bandits
- $ξ_{i_{t}} (t) = ξ_{i_{t}}$ : State of the selected bandit at each decision time $t$
- $P_{i_{t}} (ξ^{'} | ξ_{i_{t}})$ : Transition probability

Objective Function

Problem
Sequentially allocate effort between different processes to maximize the infinite-horizon expected discounted sum of rewards.

max J_{π} (\vec{ξ}) = lim_{T \to \infty} E [\sum_{t = 0}^{T - 1} a^{t} r_{i_{t}} (ξ_{i_{t}}) | \vec{ξ} (0) = \vec{ξ}]

At time $t$ , we know the state $\vec{ξ}$ , the probabilities, the discount factor and the reward function for each bandit.

Examples

Gittins Index

Multi Armed Bandit Problem (open problem for almost 40 years)

Index Policy

$\exists v_{i} (ξ_{i})$ , computed separately for each bandit, such that, for every state $\vec{ξ}$ , the optimal policy continues the bandit:

i_{t} = \underset{i \in {1, . . ., n}}{argmax} {v_{i} (ξ_{i})}

Notice that it only depends on the parameters associated with a single bandit. But how such function should be designed?

max J_{π} (\vec{ξ}) = lim_{T \to \infty} E [\sum_{t = 0}^{T - 1} a^{t} r_{i_{t}} (ξ_{i_{t}}) | \vec{ξ} (0) = \vec{ξ}]

Index Theorem

Optimal policy for this problem is an Index policy.

Derivation of the Index – Single bandit with charge

Consider a single bandit $i$ with a “playing charge” of $λ$ .

Optimal Policy is a stopping rule.

if at time $τ$ it is optimal to stop, at time $τ + 1$ it is also optimal to stop.

Then get Optimal Reward:

J (ξ_{i}) = max_{π} J_{π} (ξ_{i}) = sup_{τ > 0} E [\sum_{t = 0}^{τ - 1} a^{t} [r_{i} (ξ_{i} (t)) - λ] | ξ_{i} (0) = ξ_{i}]

For every $ξ_{i}$ , $\exists λ$ such that there is a null reward for playing: $J (ξ_{i}) = 0$

J (ξ_{i})

is convex and decreasing on

λ

it has a single root which is the Gittins Index, $v_{i} (ξ_{i})$ , given by:

v_{i} (ξ_{i}) = sup_{τ > 0} \frac{E [\sum_{t = 0}^{τ - 1} a^{t} r_{i} (ξ_{i} (t)) ∣ ξ_{i} (0) = ξ_{i}]}{E [\sum_{t = 0}^{τ - 1} a^{t} ∣ ξ_{i} (0) = ξ_{i}]}

This $v_{i} (ξ_{i})$ is called the fair charge during state $ξ_{i}$ that makes it equally desirable to play and to stop.

Gittins Index

Going back to the Simple Family of Alternative Bandit Processes with $n$ bandits and no playing charge. The Gittins index associated with bandit $i$ in state $ξ_{i}$ is

v_{i} (ξ_{i}) = sup_{τ > 0} \frac{E [\sum_{t = 0}^{τ - 1} a^{t} r_{i} (ξ_{i} (t)) ∣ ξ_{i} (0) = ξ_{i}]}{E [\sum_{t = 0}^{τ - 1} a^{t} ∣ ξ_{i} (0) = ξ_{i}]}

where τ is the stopping-time.

Numerator is the discounted REWARD up to time $τ$ . Denominator is the discounted TIME up to time $τ$ .

Example

img-20240926214019208|500
img-20240926214101689|500

GITTINS INDEX POLICY chooses the bandit with highest $v_{i} (ξ_{i})$ at every decision time $t$ .

“greatest per period rent that one would be willing to pay for ownership of the rewards arising from the bandit as it is continued for one or more periods.”

因为 index 代表了在不赚不赔时付出的 cost（fair charge），因此对每个 bandit愿意付出的 cost 越大，说明其可能获得的 reward 越高

Gittins Index Policy is optimal.

Proof

【Main ideas in the proof】

We always choose the bandit with larger current reward density value.
There is no “opportunity cost” since other bandits are frozen.（Breaks down when bandits are restless）

This proof is instructive because:

provides insight into why the Gittins Index Policy is optimal

provides insight into why it is NOT optimal for the restless case;

Necessary Conditions for Gittins
- Control space is finite
- Infinite Horizon
- Constant exponential discounting
- Single processor/server

More details: @GittinsIndexMultiarmed

Whittle Index

Restless Multi Armed Bandit Problem

Whittle extends the notion of index to restless bandits.
- At each time $t$ , exactly $m$ out of $n$ bandits are given the action
  $u_{i} (t) \in {0, 1}, \forall i, t and \sum_{i = 1}^{n} u_{i} (t) = m, \forall t$
- Action $u = 0$ no longer freezes the bandit. (passive bandits may change state and accrue reward.)

Three Optimization Problems

Original

\begin{aligned} maximize & lim_{T \to \infty} E [\sum_{t = 0}^{T - 1} a^{t} \sum_{i = 1}^{n} r_{i} (ξ_{i}, u_{i})] \\ s . t . & \sum_{i = 1}^{n} u_{i} (t) = m, \forall t \\ u_{i} (t) \in {0, 1}, \forall i \end{aligned}

Relaxed

Problem with Relaxed activation constraint.

\begin{aligned} maximize & lim_{T \to \infty} E [\sum_{t = 0}^{T - 1} a^{t} \sum_{i = 1}^{n} r_{i} (ξ_{i}, u_{i})] \\ s . t . & \sum_{t = 0}^{\infty} a^{t} \sum_{i = 1}^{n} u_{i} (t) = m / (1 - a) \\ u_{i} (t) \in {0, 1}, \forall i \end{aligned}

Lagrange

The Lagrange Dual Function is given by:

\begin{aligned} L (λ) & = maximize lim_{T \to \infty} E [\sum_{t = 0}^{T - 1} a^{t} \sum_{i = 1}^{n} (r_{i} (ξ_{i}, u_{i}) - λ u_{i} (t))] + λ (m / (1 - a)) \\ s . t . u_{i} (t) \in {0, 1}, \forall i \end{aligned}

Notice that we can decouple this problem and neglect the last term (constant).
【Decoupled Problem】(Similar to Gittins)

\begin{aligned} maximize & lim_{T \to \infty} E [\sum_{t = 0}^{T - 1} a^{t} (r_{i} (ξ_{i}, u_{i}) - λ u_{i} (t))] \\ s . t . & u_{i} (t) \in {0, 1}, \forall i \end{aligned}

Solution to the decoupled problem
The optimal policy for the Decoupled Problem may NOT be a stopping rule. In general, the optimal policy divides the state space into two subsets:
- Let $P (λ)$ be the set of ALL states for which it is optimal to idle when the playing charge is $λ$ .
- The set $P (λ)$ is characterized by the solution of the Decoupled Problem.
- Optimal Policy: play, if $ξ_{i} \in P^{C} (λ)$ ; stop, otherwise.

Indexability

The Decoupled Problem associated with bandit $i$ is indexable if $P (λ)$ increases monotonically from $\emptyset$ to the entire state space as $λ$ increases from $0 \to + \infty$ . The RMAB problem is indexable if the Decoupled Problem is indexable for all bandits.

Means that if a bandit is rested with $λ$ , it should also be rested when $λ^{'} > λ$ . 【——>Threshold】

WI policy

Whittle Index

Consider the Decoupled Problem and denote by $v_{i} (ξ_{i})$ the Whittle Index in state $ξ_{i}$ . Given indexability, $v_{i} (ξ_{i})$ is the infimum playing charge $λ$ that makes it equally desirable to play and to stop in state $ξ_{i}$ .

考虑到前面 Original Problem 的约束，可以得到如下 WI policy：
【Whittle Index Policy】 At every decision time $t$ , selects the $m$ bandits with higher values of $v_{i} (ξ_{i})$ .

The Index Policy is a low-complexity heuristic.
Challenge
- the Index Policy is only defined for problems that are indexable, a condition that is often difficult to establish.
- Moreover, it is often hard to find a closed-form expression to $v_{i} (ξ_{i})$ .
If our RMAB problem is actually a MAB, then Whittle $\equiv$ Gittins. Thus, in this case, Whittle is optimal.

J. Gittins, K. Glazebrook and R. Weber, Multi-armed Bandit Allocation Indices, 2 Ed., 2011. ↩︎